Objective: This is an introductory R walk through, creating a mock data set and then going through various ways in which one can clean it, analyze the data, create some visualizations, and then create maps visualizations from lat and lon coordinates
We start by creating DF_Cats, a mock of what a raw data
set might look like. Below we’ll go through some useful general steps in
cleaning, analyzing and visualizing results using R.
DF_Cats <- data.frame(
A = c("Luna","Willow","Name","Miso","Coco","Milo","Bubbles","Socks","Oreo","Fudge", "Muffin",NA, "Muffin"),
B = c("orange","black","Color","gray","white","gray","brown","brown","black","black","white",NA,"white"),
C = c(10,9,"Avg. Weight (whole #s)",10,12,10.76,NA,8.1,11,11.876,NA,NA,NA),
D = c("Denver","Fort Collins","Location","Boulder","Colorado Springs","Denver","Denver","Boulder","Denver","Boulder","Colorado Springs",NA,"Colorado Springs"),
E = c(4.8,4.5,"Age (round quarter %)",6.3,NA,3.75,10.25,8.5,6,NA,7.25,NA,7.25),
G = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
H = c(TRUE,TRUE,"Indoor",TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,NA,TRUE),
I = c("01/14/2025","1/13/2025","Date Surveyed",45636,"12/11/2024","Jan/10/2025","1/6/2025","1/05/2025","08/01/2025","1/8/2025","12/5/2024",NA,"12/5/2024"),
J = c(39.75115289884762,40.54987369801155,"Lat",40.01748117832861,38.80311438445151,39.74201207575289,39.716228422690236,40.00131035581603,39.74831937335296,40.02695393323672,38.80967904193633,NA,38.80967904193633),
K = c(-104.99706281032537,-105.0885319787086,"Lon",-105.27776267753427,-104.76783518791748,-104.95473978061132,-104.98374769984966,-105.27534451548036,-104.97590129536343,-105.27441445315193,-104.77443409220167,NA,-104.77443409220167)
)
| A | B | C | D | E | G | H | I | J | K |
|---|---|---|---|---|---|---|---|---|---|
| Luna | orange | 10 | Denver | 4.8 | NA | TRUE | 01/14/2025 | 39.7511528988476 | -104.997062810325 |
| Willow | black | 9 | Fort Collins | 4.5 | NA | TRUE | 1/13/2025 | 40.5498736980116 | -105.088531978709 |
| Name | Color | Avg. Weight (whole #s) | Location | Age (round quarter %) | NA | Indoor | Date Surveyed | Lat | Lon |
| Miso | gray | 10 | Boulder | 6.3 | NA | TRUE | 45636 | 40.0174811783286 | -105.277762677534 |
| Coco | white | 12 | Colorado Springs | NA | NA | TRUE | 12/11/2024 | 38.8031143844515 | -104.767835187917 |
| Milo | gray | 10.76 | Denver | 3.75 | NA | TRUE | Jan/10/2025 | 39.7420120757529 | -104.954739780611 |
| Bubbles | brown | NA | Denver | 10.25 | NA | TRUE | 1/6/2025 | 39.7162284226902 | -104.98374769985 |
| Socks | brown | 8.1 | Boulder | 8.5 | NA | TRUE | 1/05/2025 | 40.001310355816 | -105.27534451548 |
| Oreo | black | 11 | Denver | 6 | NA | TRUE | 08/01/2025 | 39.748319373353 | -104.975901295363 |
| Fudge | black | 11.876 | Boulder | NA | NA | TRUE | 1/8/2025 | 40.0269539332367 | -105.274414453152 |
| Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
Notice that DF_Cats has the column names stored in the
third row. row_to_names() can make the specified row become
the names of the columns.
#library(dplyr) - Needed for piping ( %>% ) operations
#library(janitor)
DF_Cats = DF_Cats %>% row_to_names(row_number=3, remove_row = TRUE, remove_rows_above = FALSE)
#Here row_to_names() used the specified row number 3 to assign names to the columns. The third row was afterwards removed (default) and the rows above (rows one and two) were not removed.
| Name | Color | Avg. Weight (whole #s) | Location | Age (round quarter %) | NA | Indoor | Date Surveyed | Lat | Lon | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.8 | NA | TRUE | 01/14/2025 | 39.7511528988476 | -104.997062810325 |
| 2 | Willow | black | 9 | Fort Collins | 4.5 | NA | TRUE | 1/13/2025 | 40.5498736980116 | -105.088531978709 |
| 4 | Miso | gray | 10 | Boulder | 6.3 | NA | TRUE | 45636 | 40.0174811783286 | -105.277762677534 |
| 5 | Coco | white | 12 | Colorado Springs | NA | NA | TRUE | 12/11/2024 | 38.8031143844515 | -104.767835187917 |
| 6 | Milo | gray | 10.76 | Denver | 3.75 | NA | TRUE | Jan/10/2025 | 39.7420120757529 | -104.954739780611 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | NA | TRUE | 1/6/2025 | 39.7162284226902 | -104.98374769985 |
| 8 | Socks | brown | 8.1 | Boulder | 8.5 | NA | TRUE | 1/05/2025 | 40.001310355816 | -105.27534451548 |
| 9 | Oreo | black | 11 | Denver | 6 | NA | TRUE | 08/01/2025 | 39.748319373353 | -104.975901295363 |
| 10 | Fudge | black | 11.876 | Boulder | NA | NA | TRUE | 1/8/2025 | 40.0269539332367 | -105.274414453152 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
| 12 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 13 | Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
Now we can use clean_names() to change column names to
something that’s easy to process by code
DF_Cats = DF_Cats %>% clean_names()
#Notice how parenthesis and blank spaces are converted to _ . % are replaced with percent. # are replaced with number and all letters are now in lower case.
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | na | indoor | date_surveyed | lat | lon | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.8 | NA | TRUE | 01/14/2025 | 39.7511528988476 | -104.997062810325 |
| 2 | Willow | black | 9 | Fort Collins | 4.5 | NA | TRUE | 1/13/2025 | 40.5498736980116 | -105.088531978709 |
| 4 | Miso | gray | 10 | Boulder | 6.3 | NA | TRUE | 45636 | 40.0174811783286 | -105.277762677534 |
| 5 | Coco | white | 12 | Colorado Springs | NA | NA | TRUE | 12/11/2024 | 38.8031143844515 | -104.767835187917 |
| 6 | Milo | gray | 10.76 | Denver | 3.75 | NA | TRUE | Jan/10/2025 | 39.7420120757529 | -104.954739780611 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | NA | TRUE | 1/6/2025 | 39.7162284226902 | -104.98374769985 |
| 8 | Socks | brown | 8.1 | Boulder | 8.5 | NA | TRUE | 1/05/2025 | 40.001310355816 | -105.27534451548 |
| 9 | Oreo | black | 11 | Denver | 6 | NA | TRUE | 08/01/2025 | 39.748319373353 | -104.975901295363 |
| 10 | Fudge | black | 11.876 | Boulder | NA | NA | TRUE | 1/8/2025 | 40.0269539332367 | -105.274414453152 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
| 12 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 13 | Muffin | white | NA | Colorado Springs | 7.25 | NA | TRUE | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
We use remove_empty() to get ride of empty rows and
columns (default) and remove_constant() to get rid of
columns with constant values.
DF_Cats = DF_Cats %>%
remove_empty() %>% #removes both rows and columns that are empty
remove_constant() #drops columns with a single constant value
#We had a column and a row with all NA values that were removed. Our "Indoor" column was also removed, given that all values were TRUE and it didn't add information.
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | date_surveyed | lat | lon | |
|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.8 | 01/14/2025 | 39.7511528988476 | -104.997062810325 |
| 2 | Willow | black | 9 | Fort Collins | 4.5 | 1/13/2025 | 40.5498736980116 | -105.088531978709 |
| 4 | Miso | gray | 10 | Boulder | 6.3 | 45636 | 40.0174811783286 | -105.277762677534 |
| 5 | Coco | white | 12 | Colorado Springs | NA | 12/11/2024 | 38.8031143844515 | -104.767835187917 |
| 6 | Milo | gray | 10.76 | Denver | 3.75 | Jan/10/2025 | 39.7420120757529 | -104.954739780611 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | 1/6/2025 | 39.7162284226902 | -104.98374769985 |
| 8 | Socks | brown | 8.1 | Boulder | 8.5 | 1/05/2025 | 40.001310355816 | -105.27534451548 |
| 9 | Oreo | black | 11 | Denver | 6 | 08/01/2025 | 39.748319373353 | -104.975901295363 |
| 10 | Fudge | black | 11.876 | Boulder | NA | 1/8/2025 | 40.0269539332367 | -105.274414453152 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
| 13 | Muffin | white | NA | Colorado Springs | 7.25 | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
With get_dupes() you specify the variable combination to
search for duplicates and it returns the duplicated rows along with a
count of the number of rows sharing that combination of duplicated
values.
DF_Cats %>% get_dupes() #examining duplicates
## name color avg_weight_whole_number_s location
## 1 Muffin white <NA> Colorado Springs
## 2 Muffin white <NA> Colorado Springs
## age_round_quarter_percent date_surveyed lat lon
## 1 7.25 12/5/2024 38.8096790419363 -104.774434092202
## 2 7.25 12/5/2024 38.8096790419363 -104.774434092202
## dupe_count
## 1 2
## 2 2
DF_Cats = DF_Cats[-11,] #removing duplicate row
#In this case we examine the last row with "Muffin" as the name and decide it's a duplicate entry due perhaps to input error. It is therefore removed.
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | date_surveyed | lat | lon | |
|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.8 | 01/14/2025 | 39.7511528988476 | -104.997062810325 |
| 2 | Willow | black | 9 | Fort Collins | 4.5 | 1/13/2025 | 40.5498736980116 | -105.088531978709 |
| 4 | Miso | gray | 10 | Boulder | 6.3 | 45636 | 40.0174811783286 | -105.277762677534 |
| 5 | Coco | white | 12 | Colorado Springs | NA | 12/11/2024 | 38.8031143844515 | -104.767835187917 |
| 6 | Milo | gray | 10.76 | Denver | 3.75 | Jan/10/2025 | 39.7420120757529 | -104.954739780611 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | 1/6/2025 | 39.7162284226902 | -104.98374769985 |
| 8 | Socks | brown | 8.1 | Boulder | 8.5 | 1/05/2025 | 40.001310355816 | -105.27534451548 |
| 9 | Oreo | black | 11 | Denver | 6 | 08/01/2025 | 39.748319373353 | -104.975901295363 |
| 10 | Fudge | black | 11.876 | Boulder | NA | 1/8/2025 | 40.0269539332367 | -105.274414453152 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | 12/5/2024 | 38.8096790419363 | -104.774434092202 |
Dates can be tricky, coming in many different formats. In our
date_surveyed column we notice dates in several formats.
Some months are represented with a single digit, others with two, and
others with letters (01 vs. 1 vs. Jan). When importing from excel we
might also receive dates in a numeric encoding system such as 45636. And
finally, one of our entries is in what seems to be day/month/year format
(08/01/2025) instead of month/day/year. This format is common practice
in other countries and is good to check for when working with multiple
data sets or when evaluating entry errors.
First we’ll change the numeric value (45636) to
a Date format using convert_to_date() from the
janitor package. convert_to_date()
recognizes Excel date and numeric formats of character or
text classes and converts them to a Date
class.
Second we’ll use the mdy() and dmy()
functions from the lubridate package to change
month/day/year and day/month/year
entries to a Date class.
# library(lubridate)
DF_Cats$date_surveyed_ = as.Date("")
#Create a new column in Date format to populate with our transformed Dates
DF_Cats$date_surveyed_[3] = convert_to_date(DF_Cats$date_surveyed[3])
#converting our numeric entry (45636) to a date class
DF_Cats$date_surveyed_[8] = dmy(DF_Cats$date_surveyed[8])
#converting our day/month/year entry to a date class
DF_Cats$date_surveyed_[-c(3,8)] = mdy(DF_Cats$date_surveyed[-c(3,8)])
#converting the rest of the month/day/year entries to Date format
DF_Cats = DF_Cats[,-6]
#Removing our original raw dates column
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | lat | lon | date_surveyed_ | |
|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.8 | 39.7511528988476 | -104.997062810325 | 2025-01-14 |
| 2 | Willow | black | 9 | Fort Collins | 4.5 | 40.5498736980116 | -105.088531978709 | 2025-01-13 |
| 4 | Miso | gray | 10 | Boulder | 6.3 | 40.0174811783286 | -105.277762677534 | 2024-12-10 |
| 5 | Coco | white | 12 | Colorado Springs | NA | 38.8031143844515 | -104.767835187917 | 2024-12-11 |
| 6 | Milo | gray | 10.76 | Denver | 3.75 | 39.7420120757529 | -104.954739780611 | 2025-01-10 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | 39.7162284226902 | -104.98374769985 | 2025-01-06 |
| 8 | Socks | brown | 8.1 | Boulder | 8.5 | 40.001310355816 | -105.27534451548 | 2025-01-05 |
| 9 | Oreo | black | 11 | Denver | 6 | 39.748319373353 | -104.975901295363 | 2025-01-08 |
| 10 | Fudge | black | 11.876 | Boulder | NA | 40.0269539332367 | -105.274414453152 | 2025-01-08 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | 38.8096790419363 | -104.774434092202 | 2024-12-05 |
In our avg_weight_whole_number_s column we notice that
our values are supposed to be only whole numbers, but not all of them
are. round_half_up() from janitor will
round all our values up to the nearest whole number. Compare this to
round(), in which R by default uses “banker’s rounding” to
round halves to the nearest even number.
Our other column age_round_quarter_percent takes age
data, but only values of quarters. round_to_fraction() will
round the values to the nearest specified denominator.
#Rounding comparison example below
#values from .5 to 4.5, increasing by 1, are rounded by R default
round(seq(0.5, 4.5, 1))
## [1] 0 2 2 4 4
#values rounded by round_half_up() from janitor
round_half_up(seq(0.5, 4.5, 1))
## [1] 1 2 3 4 5
DF_Cats$avg_weight_whole_number_s = round_half_up(as.numeric(DF_Cats$avg_weight_whole_number_s))
#as.numeric() is needed because our column "avg_weight_whole_number_s" is currently of class character, given that it has NA values.
DF_Cats$age_round_quarter_percent = round_to_fraction(as.numeric(DF_Cats$age_round_quarter_percent), denominator = 4)
#rounding all age values to quarters
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | lat | lon | date_surveyed_ | |
|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.75 | 39.7511528988476 | -104.997062810325 | 2025-01-14 |
| 2 | Willow | black | 9 | Fort Collins | 4.50 | 40.5498736980116 | -105.088531978709 | 2025-01-13 |
| 4 | Miso | gray | 10 | Boulder | 6.25 | 40.0174811783286 | -105.277762677534 | 2024-12-10 |
| 5 | Coco | white | 12 | Colorado Springs | NA | 38.8031143844515 | -104.767835187917 | 2024-12-11 |
| 6 | Milo | gray | 11 | Denver | 3.75 | 39.7420120757529 | -104.954739780611 | 2025-01-10 |
| 7 | Bubbles | brown | NA | Denver | 10.25 | 39.7162284226902 | -104.98374769985 | 2025-01-06 |
| 8 | Socks | brown | 8 | Boulder | 8.50 | 40.001310355816 | -105.27534451548 | 2025-01-05 |
| 9 | Oreo | black | 11 | Denver | 6.00 | 39.748319373353 | -104.975901295363 | 2025-01-08 |
| 10 | Fudge | black | 12 | Boulder | NA | 40.0269539332367 | -105.274414453152 | 2025-01-08 |
| 11 | Muffin | white | NA | Colorado Springs | 7.25 | 38.8096790419363 | -104.774434092202 | 2024-12-05 |
Is this something that we want to do? Or should lines with missing
values be removed? In this example we’ll fill in the missing values, but
will provide code for how to delete those entries, if that were
preferred. We’ll be working with the columns
age_round_quarter_percent and
avg_weight_whole_number_s.
Let’s say we determine that both columns are more or less normally distributed through the shapiro test. And so we decide to fill in NA values with the mean.
shapiro.test(DF_Cats$age_round_quarter_percent) #p values are > .05
##
## Shapiro-Wilk normality test
##
## data: DF_Cats$age_round_quarter_percent
## W = 0.95313, p-value = 0.7427
shapiro.test(DF_Cats$avg_weight_whole_number_s)
##
## Shapiro-Wilk normality test
##
## data: DF_Cats$avg_weight_whole_number_s
## W = 0.93444, p-value = 0.5573
#Replacing NA values in "age_round_quarter_percent" with the mean of the column, using round_to_fraction() to round mean to the nearest quarter.
DF_Cats$age_round_quarter_percent = ifelse(is.na(DF_Cats$age_round_quarter_percent),
round_to_fraction(mean(DF_Cats$age_round_quarter_percent,
na.rm = TRUE), denominator = 4),
DF_Cats$age_round_quarter_percent)
#Replacing NA values in "avg_weight_whole_number_s" with the mean of the column, using round_half_up() to get rounded up whole numbers.
DF_Cats$avg_weight_whole_number_s = ifelse(is.na(DF_Cats$avg_weight_whole_number_s),
round_half_up(mean(DF_Cats$avg_weight_whole_number_s,
na.rm = TRUE)),
DF_Cats$avg_weight_whole_number_s)
| name | color | avg_weight_whole_number_s | location | age_round_quarter_percent | lat | lon | date_surveyed_ | |
|---|---|---|---|---|---|---|---|---|
| 1 | Luna | orange | 10 | Denver | 4.75 | 39.7511528988476 | -104.997062810325 | 2025-01-14 |
| 2 | Willow | black | 9 | Fort Collins | 4.50 | 40.5498736980116 | -105.088531978709 | 2025-01-13 |
| 4 | Miso | gray | 10 | Boulder | 6.25 | 40.0174811783286 | -105.277762677534 | 2024-12-10 |
| 5 | Coco | white | 12 | Colorado Springs | 6.50 | 38.8031143844515 | -104.767835187917 | 2024-12-11 |
| 6 | Milo | gray | 11 | Denver | 3.75 | 39.7420120757529 | -104.954739780611 | 2025-01-10 |
| 7 | Bubbles | brown | 10 | Denver | 10.25 | 39.7162284226902 | -104.98374769985 | 2025-01-06 |
| 8 | Socks | brown | 8 | Boulder | 8.50 | 40.001310355816 | -105.27534451548 | 2025-01-05 |
| 9 | Oreo | black | 11 | Denver | 6.00 | 39.748319373353 | -104.975901295363 | 2025-01-08 |
| 10 | Fudge | black | 12 | Boulder | 6.50 | 40.0269539332367 | -105.274414453152 | 2025-01-08 |
| 11 | Muffin | white | 10 | Colorado Springs | 7.25 | 38.8096790419363 | -104.774434092202 | 2024-12-05 |
If you want to remove the NA values, drop_na() from the
tidyverse package is a simple option to remove all rows
with NA values in the specified columns.
#library(tidyr)
#if no columns are specified, rows with NA values in any column will be removed.
DF_Cats %>% drop_na()
## name color avg_weight_whole_number_s location
## 1 Luna orange 10 Denver
## 2 Willow black 9 Fort Collins
## 3 Miso gray 10 Boulder
## 4 Coco white 12 Colorado Springs
## 5 Milo gray 11 Denver
## 6 Bubbles brown 10 Denver
## 7 Socks brown 8 Boulder
## 8 Oreo black 11 Denver
## 9 Fudge black 12 Boulder
## 10 Muffin white 10 Colorado Springs
## age_round_quarter_percent lat lon date_surveyed_
## 1 4.75 39.7511528988476 -104.997062810325 2025-01-14
## 2 4.50 40.5498736980116 -105.088531978709 2025-01-13
## 3 6.25 40.0174811783286 -105.277762677534 2024-12-10
## 4 6.50 38.8031143844515 -104.767835187917 2024-12-11
## 5 3.75 39.7420120757529 -104.954739780611 2025-01-10
## 6 10.25 39.7162284226902 -104.98374769985 2025-01-06
## 7 8.50 40.001310355816 -105.27534451548 2025-01-05
## 8 6.00 39.748319373353 -104.975901295363 2025-01-08
## 9 6.50 40.0269539332367 -105.274414453152 2025-01-08
## 10 7.25 38.8096790419363 -104.774434092202 2024-12-05
Lets say you wanted to see if any cats were surveyed in the same city
and share the same hair color. This might give insights into the data,
potentially suggesting that these cats are related. With
get_dupes() you can specify the columns you want to
compare, and then get an output of the rows that share those
characteristics along with a count of those rows.
#Are Coco and Muffin related? Do they live near each other?
DF_Cats %>% get_dupes(color,location)
## color location dupe_count name avg_weight_whole_number_s
## 1 white Colorado Springs 2 Coco 12
## 2 white Colorado Springs 2 Muffin 10
## age_round_quarter_percent lat lon date_surveyed_
## 1 6.50 38.8031143844515 -104.767835187917 2024-12-11
## 2 7.25 38.8096790419363 -104.774434092202 2024-12-05
With tabyl() we can easily see how many cats of the
different colors are present in each city. tabyl() can
compare up to three columns and results can be styled through
**adorn_*()** functions.
DF_Cats %>% tabyl(color,location)
## color Boulder Colorado Springs Denver Fort Collins
## black 1 0 1 1
## brown 1 0 1 0
## gray 1 0 1 0
## orange 0 0 1 0
## white 0 2 0 0
#Compare the above result with the adorned version below
DF_Cats %>% tabyl(color,location) %>%
adorn_totals(c("row","col")) %>% #add totals to rows and columns
adorn_percentages() %>% #calculate percentages
adorn_pct_formatting(rounding = "half up", digits = 0) %>% #specify number of percent decimals and adds the % symbol
adorn_ns() %>% #include the total number of entries in parenthesis
adorn_title("combined") #Add a title for rows, columns or both combined
## color/location Boulder Colorado Springs Denver Fort Collins Total
## black 33% (1) 0% (0) 33% (1) 33% (1) 100% (3)
## brown 50% (1) 0% (0) 50% (1) 0% (0) 100% (2)
## gray 50% (1) 0% (0) 50% (1) 0% (0) 100% (2)
## orange 0% (0) 0% (0) 100% (1) 0% (0) 100% (1)
## white 0% (0) 100% (2) 0% (0) 0% (0) 100% (2)
## Total 30% (3) 20% (2) 40% (4) 10% (1) 100% (10)
summarise() is good to get specified summary statistics
from a grouped variable. In the example below we’ll calculate weight
mean, median, standard deviation, and variance, grouping by
location.
DF_Cats %>%
group_by(location) %>%
summarise(n = n(), # Count by location
mean_weight = mean(avg_weight_whole_number_s, na.rm = TRUE), # Mean
median_weight = median(avg_weight_whole_number_s, na.rm = TRUE), # Median
sd_weight = sd(avg_weight_whole_number_s, na.rm = TRUE), # Standard deviation
var_weight = var(avg_weight_whole_number_s, na.rm = TRUE)) # Variance
## # A tibble: 4 × 6
## location n mean_weight median_weight sd_weight var_weight
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Boulder 3 10 10 2 4
## 2 Colorado Springs 2 11 11 1.41 2
## 3 Denver 4 10.5 10.5 0.577 0.333
## 4 Fort Collins 1 9 9 NA NA
The output using skim() is slightly different from
summarise(), but it’s perhaps a little quicker and uses
less code. If no parameters are passed, skim() will
summarize the entire data set by variable class.
# library(skimr)
DF_Cats %>% skim()
| Name | Piped data |
| Number of rows | 10 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 4 | 7 | 0 | 10 | 0 |
| color | 0 | 1 | 4 | 6 | 0 | 5 | 0 |
| location | 0 | 1 | 6 | 16 | 0 | 4 | 0 |
| lat | 0 | 1 | 15 | 16 | 0 | 10 | 0 |
| lon | 0 | 1 | 16 | 17 | 0 | 10 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_surveyed_ | 0 | 1 | 2024-12-05 | 2025-01-14 | 2025-01-07 | 9 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| avg_weight_whole_number_s | 0 | 1 | 10.30 | 1.25 | 8.00 | 10.00 | 10.00 | 11.00 | 12.00 | ▂▂▇▃▃ |
| age_round_quarter_percent | 0 | 1 | 6.42 | 1.93 | 3.75 | 5.06 | 6.38 | 7.06 | 10.25 | ▇▅▇▂▂ |
#Below we use skim() to get summary statistics of age, and weight by color
DF_Cats %>%
group_by(color) %>%
select(avg_weight_whole_number_s,age_round_quarter_percent) %>%
skim()
| Name | Piped data |
| Number of rows | 10 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | color |
Variable type: numeric
| skim_variable | color | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| avg_weight_whole_number_s | black | 0 | 1 | 10.67 | 1.53 | 9.00 | 10.00 | 11.00 | 11.50 | 12.00 | ▇▁▁▇▇ |
| avg_weight_whole_number_s | brown | 0 | 1 | 9.00 | 1.41 | 8.00 | 8.50 | 9.00 | 9.50 | 10.00 | ▇▁▁▁▇ |
| avg_weight_whole_number_s | gray | 0 | 1 | 10.50 | 0.71 | 10.00 | 10.25 | 10.50 | 10.75 | 11.00 | ▇▁▁▁▇ |
| avg_weight_whole_number_s | orange | 0 | 1 | 10.00 | NA | 10.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▇▁▁ |
| avg_weight_whole_number_s | white | 0 | 1 | 11.00 | 1.41 | 10.00 | 10.50 | 11.00 | 11.50 | 12.00 | ▇▁▁▁▇ |
| age_round_quarter_percent | black | 0 | 1 | 5.67 | 1.04 | 4.50 | 5.25 | 6.00 | 6.25 | 6.50 | ▇▁▁▇▇ |
| age_round_quarter_percent | brown | 0 | 1 | 9.38 | 1.24 | 8.50 | 8.94 | 9.38 | 9.81 | 10.25 | ▇▁▁▁▇ |
| age_round_quarter_percent | gray | 0 | 1 | 5.00 | 1.77 | 3.75 | 4.38 | 5.00 | 5.62 | 6.25 | ▇▁▁▁▇ |
| age_round_quarter_percent | orange | 0 | 1 | 4.75 | NA | 4.75 | 4.75 | 4.75 | 4.75 | 4.75 | ▁▁▇▁▁ |
| age_round_quarter_percent | white | 0 | 1 | 6.88 | 0.53 | 6.50 | 6.69 | 6.88 | 7.06 | 7.25 | ▇▁▁▁▇ |
I like using ggplot2 over base R or other graphics because of how flexible, consistent, and visually appealing it is. ggplot2 follows a consistent underlying grammar of graphics that’s easy to follow. With some added packages we’ll see below, you can customize visuals even further and make them interactive.
First we create a box plot of average weight by location and assign the plot the name of “Box_Plot”
# library(ggplot2)
Box_Plot = ggplot(data = DF_Cats,
mapping = aes(x = location, y = avg_weight_whole_number_s)) +
labs(x = "Location",
y = "Average Weight") +
geom_boxplot() +
theme_dark()
Box_Plot
Second we create a scatter plot of average weight by location and color, assigning it the name of “Scatter_Plot”
Scatter_Plot = ggplot(data = DF_Cats,
mapping = aes(age_round_quarter_percent, avg_weight_whole_number_s, colour = location)) +
geom_point(size = 3) +
geom_smooth(method = lm, color = "red", se = TRUE) +
labs(colour = "City",
x = "Age (Years)",
y = "Average Weight (Pounds)") +
theme_classic()
Scatter_Plot
plotly has a lot of great features that make your graphics interactive. Below we’ll make our box plot interactive.
Note: that in a non HTML format, the plot will be static. To view the interactive html version download this file and open it on a browser.
# library(plotly)
ggplotly(Box_Plot)
ggforce enhances scatter plots by allowing you to highlight certain points, drawing an ellipse around points that adhere to a given filter. A sign pointing to the ellipse can also be added. Below we will add the ellipse and sign to our saved “Scatter_Plot” created above.
# library(ggforce)
Scatter_Plot = Scatter_Plot +
geom_mark_ellipse(
data = DF_Cats %>% filter(avg_weight_whole_number_s > 10),
aes(x0 = 7.5, y0= 9, label = "These are cats heavier than 10 pounds", group = -1L),
fill = "red", colour = "black", alpha = .2, label.fontsize = 9)
Scatter_Plot
patchwork is great for letting you combined multiple plots into one and be able to visualize or save them as one image.
# library(patchwork)
Box_Plot / Scatter_Plot
If you have latitude and longitude coordinates in your data, you can map each entry as a point on a map and add whichever label you want.
Note: that in a non HTML format, the map will be static. To view the interactive html version download this file and open it on a browser.
# library(sf)
# library(mapview)
DF_Cats_Map <- st_as_sf(DF_Cats, coords = c("lon", "lat"), crs = 4326)
#spatial data must first be converted to simple features object in order for it to be read and mapped appropriately
mapview(DF_Cats_Map, map.types ="CartoDB.Positron",label = DF_Cats_Map$name)